Robust Differential Abundance Analysis of Microbiome Sequencing Data

It is well known that the microbiome data are ridden with outliers and have heavy distribution tails, but the impact of outliers and heavy-tailedness has yet to be examined systematically. This paper investigates the impact of outliers and heavy-tailedness on differential abundance analysis (DAA) using the linear models for the differential abundance analysis (LinDA) method and proposes effective strategies to mitigate their influence. The presence of outliers and heavy-tailedness can significantly decrease the power of LinDA. We investigate various techniques to address outliers and heavy-tailedness, including generalizing LinDA into a more flexible framework that allows for the use of robust regression and winsorizing the data before applying LinDA. Our extensive numerical experiments and real-data analyses demonstrate that robust Huber regression has overall the best performance in addressing outliers and heavy-tailedness.


Introduction
The human microbiome is a complex and multifaceted ecosystem comprising diverse microorganisms, including bacteria, viruses, and fungi.These microorganisms inhabit different parts of the human body and play a crucial role in various biological functions essential for human health and disease prevention [1,2].Understanding the diversity and abundance of microorganisms in the microbiome, such as the gut microbiome, is crucial for identifying potential pathogens that can cause harm or probiotics that promote good health [3,4].
Metagenomic sequencing is the primary method used to study the microbiome.It provides a comprehensive snapshot of the microbiome's composition by sequencing the genetic material of all microorganisms in a sample [5,6].However, this technique only provides relative abundance data, with the abundance of each microorganism expressed as a proportion of the total number of microorganisms in the sample [7].Absolute abundance measurement can be achieved through various experimental techniques like qPCR, spike-in, and flow cytometry.However, these techniques have yet to be widely adopted due to their severe limitations [8].Therefore, the prevailing sequencing protocol is still only capable of measuring the relative abundances.Nevertheless, when combined with appropriate statistical methods, relative abundance data can still provide valuable insights into the composition and function of the microbiome and its impact on health and disease.Following the processing of sequence reads using a bioinformatic pipeline, such as DADA2 [9] for 16S-targeted sequencing and MetaPhlAn2 [6] for shotgun metagenomic data, an abundance table that records the frequencies of the detected microbial taxa is generated.This table is used for downstream statistical analyses along with metadata that captures sample-level characteristics.
Differential abundance analysis (DAA) is a central downstream task that seeks to identify microbial taxa whose abundance correlates with a variable of interest.However, classical statistical tools, such as ANOVA and rank-based tests, are unsuitable for DAA as they do not account for the compositional nature of microbiome data and can lead to significant false discoveries.This is due to changes in one microorganism's abundance that can influence others' relative abundance.Several differential analysis methods have been developed that consider compositional effects [10] to address this issue.These methods use either robust normalization techniques, such as TMM, RLE, CSS, and GMPR [11][12][13][14] or (log) ratio-based approaches, including ALDEx2 [15], ANCOM-BC [16], MaAsLin2 [17], and LinDA [18].
Identifying differentially abundant taxa in real-world data can be challenging due to the presence of outliers (overly abundant taxa) that may negatively impact the effectiveness of existing DAA methods [19][20][21], where outliers mean an extremely high abundance of a particular taxon in a few samples.Despite their widespread use, there has been a need for more systematic investigations into the influence of outliers on DAA methods.Additionally, log-ratio-based approaches (e.g., LinDA, ANCOM-BC, MaAsLin2) rely on normality assumption for random errors, which may not be valid for real-world data.Our observations suggest that this assumption can be violated in real-world data, leading to a phenomenon we term heavy-tailedness, which can impair the performance of existing DAA methods.Here, heavy-tailedness means that the tail of the distribution of random errors is heavier than the normal distribution.
This study aims to investigate the impact of outliers and heavy-tailedness on differential abundance analysis (DAA) methods and identify effective strategies to mitigate these challenges.LinDA was used as the benchmark DAA method, and the primary aim was to identify the most effective strategy to address outliers and heavy-tailedness in DAA.The study revealed that the presence of outliers or heavy-tailedness significantly reduces the power of detecting differential taxa.To address these challenges, we proposed a general M-estimation framework for DAA, which encompasses differential analysis based on Huber regression as a special case.Huber regression is a widely used statistical method that guards against outliers and heavy-tailedness in regression problems [22].Additionally, this study explored the effectiveness of the winsorization method, a statistical data preprocessing technique in combination with LinDA, for handling noisy data.Winsorization replaces extreme values with less extreme ones and has been shown to effectively handle outliers and heavy-tailedness in data analysis [23].This study's simulations demonstrated that Huber regression exhibits superior robustness against outliers and heavy-tailedness compared to the LinDA and LinDA methods with winsorization.Therefore, this study recommends using the Huber method in instances wherein the dataset is subject to noise.
We summarize our main contributions as follows: 1.
This study conducted comprehensive simulations to investigate the impact of outliers and different types of heavy-tailedness on the various DAA methods.

2.
This study introduced a general M-estimation framework for DAA, which includes several methods that are robust to outliers and heavy-tailedness.

3.
This study conducted extensive simulations to examine the performance of various DAA methods to address outliers and heavy-tailedness, including multiple levels of the winsorization technique and different M-estimation-based methods.The experiments demonstrated that the proposed Huber regression method based on the M-estimation framework is more stable for outliers and heavy-tailedness.
The remainder of this paper is organized as follows.In Section 2.1, we introduce the regression-based framework for DAA, which includes LinDA and the Huber regressionbased method as special cases.Section 2.2 provides an overview of the winsorization method and its implementation details.Our simulation studies are presented in Section 3. Lastly, Section 4 presents some real data analyses.Additional numerical results can be found in the Appendixes A-C.

A Regression-Based Framework for Differential Analysis
In this section, we present a generalization of the LinDA method, utilizing a Mestimation framework for DAA based on the central log ratio (CLR) transformation.By replacing the l 2 loss with a robust loss function, we propose a more robust approach for DAA.

CLR-Based Log-Linear Models
Let X is and Y is denote the absolute abundance and observed read count of the i-th taxon in the s-th sample, respectively.For the s-th sample, the total read count of all taxon, N s = ∑ m i=1 Y is , is determined by the sequencing depth and DNA materials.Given N s , it is natural to model the stratified count data over m taxon using a multinomial distribution as Under (1), we have log where the ratio on the left side represents the sample proportion, while the ratio on the right side represents the population proportion, and e is denotes the estimation error.The sample proportion is the maximum likelihood estimator (MLE) of the population proportion based on the multinomial model (1).By the consistency of the MLE, e is is expected to diminish as N s increases.We consider the log linear model on the absolute abundance where c s = (1, c s1 , . . ., c sd ) includes intercept and the d-dimensional covariates to be adjusted, u s is the variable of interest, and is is the error term.Here, we assume a log-linear model on the absolute abundance, which is a reasonable and widely adopted approach in abundance data analysis [24,25].The objective here is to identify taxa that show differential abundance relative to u s .To this end, we simultaneously test the following m hypotheses: Set ε is = is + e is .Under (2) and (3), the CLR data satisfy the following linear model:

M-Estimation Framework for Differential Analysis
Define ᾱi = α 0i − ᾱ and βi = β 0i − β.We propose to estimate ᾱi and βi by solving the following M-estimation problem: where L is a loss function chosen by the practitioners and the resulting estimators are referred to as M-estimators.When L is the l 2 loss, i.e., L(r) = r 2 , (α i , βi ) become the ordinary least squares (OLS) estimators, which has been considered in [18].
Example 1.To handle heavy-tailedness and outliers in datasets, we can employ the robust loss function in (4), which is commonly used in robust regression.Robust regression is designed to estimate the parameters of a regression model in the presence of influential observations that can distort the results.Unlike ordinary least squares regression, which assumes normally distributed errors with constant variance, robust regression offers more flexible assumptions about the error distribution and is less sensitive to outliers.It becomes particularly advantageous when data contain outliers that cannot be readily removed or explained, or when ensuring that the regression coefficients are not overly influenced by a few observations.Below are some popular robust regression loss functions. 1.
Huber's loss is defined as , where c is the hyperparameter and the default value is 1.345, which is widely used in robust regression studies.Notice that the Huber estimator down-weights the influence of observations with large residuals, resulting in less impact on the estimated regression coefficients.Additionally, Huber [22] argued that if the true distribution was normal, this loss function is asymptotically 95% as efficient as least squares.

2.
Tukey's bisquare loss is defined as where c 0 = 4.685 is the standard constant for this loss function.It is worth noting that this function has been shown to possess an asymptotic efficiency of 95% with respect to linear regression for the normal distribution.

3.
Quantile regression loss is defined as where τ represents the quantile level of interest.Notably, when τ = 1 2 , the loss function is equivalent to the l 1 loss, expressed as l 1 (r) = |r|.The l 1 loss corresponds to the loss function utilized in the least absolute deviations (LAD) regression method.
Let z s = (u s , c s ) and θ i = (α i , β i ) .We define θi (and θi ) in the same way as θ i by replacing α i with αi (and ᾱi ), and β i with βi (and βi ), respectively.Denote r is = W is − z s θ i , and define ris and ris analogously by replacing θ i with θi and θi , respectively.We can then rewrite each summand of the objective function in (4) as L(r is where ψ is referred to as the influence curve [26].Under certain regularity conditions (see, for example, Theorem 5.21 in Van der Vaart [27]), the asymptotic normality of the M-estimator is given as follows [26]: where In practice, we will not know the true distribution of the error term and the true regression parameters θi .Therefore, we propose using the plug-in method to estimate the asymptotic variance of θi (rather than √ n θi ), which is given by When L is the l 2 loss, the estimators used in LinDA and ( 7) are asymptotically equivalent.
From the above discussions, αi obtained by minimizing ( 4) is an asymptotically unbiased estimator for α i − ᾱ.To estimate α i , it remains to find an estimator for ᾱ.To this end, we adopt the mode correction method proposed by Zhou et al. [18].Specifically, assuming that only a small portion of taxa exhibit differential abundance, meaning that most values of α i are equal to 0. Under this assumption, the mode of αi is expected to be close to −ᾱ.Hence, we can estimate ᾱ through estimating the mode of {α i : 1 ≤ i ≤ m}.Specifically, we use the kernel smoothing approach to estimate the mode and let where K is a kernel function satisfying ∞ −∞ K(x) dx = 1 and h is the bandwidth.Then, the resulting bias-corrected estimator of α i is given by αi = αi + α.In our implementation, we use mlv function in R package modeest to estimate the mode.
Once the bias-corrected estimator is obtained, we implement the t-test by defining the test statistic as T i = αi / σi , where σ2 i is the variance estimator of αi in the regression problem, corresponding to the (1, 1)th entry of Σi defined in (7).Although we have the asymptotic normality of the M-estimator, our simulations revealed that the t-distribution provides a better approximation for the sampling distribution of T i and offers better FDR control for small samples.Consequently, we propose using the t-test in our method.The p-value is then calculated as 2P(T ≥ |T i |), where T follows a t-distribution with n − d − 2 degrees of freedom.To control the FDR, we recommend using the Benjamini-Hochberg (BH) procedure to adjust the p-values obtained for each taxon.The taxa with an adjusted p-value less than a certain threshold are referred to as differential taxa.
We summarize our procedure as follows: 1.
For each taxon, solve the optimization problem defined in (4) to obtain the estimator αi and its variance estimator σ2 i .

2.
Calculate the mode based on {α i : 1 ≤ i ≤ m}, and perform mode correction to obtain the bias-corrected estimator αi .

3.
Compute the test statistics T i and the p-value for each taxon.

4.
Apply the BH procedure to adjust the p-values.

Remark 1.
Given the practical challenge of determining the optimal hyperparameter to use in the loss function (e.g., c in the Huber loss function), we propose utilizing the Cauchy combination rule [28] for aggregating the p-values obtained from different hyperparameters.This approach addresses the difficulty of selecting the most suitable hyperparameter by combining the results from multiple options.
Remark 2. Based on the regression-based framework, our method can be readily extended to apply to the mixed-effect model.Please refer to Appendix B for more details.

Winsorization
Winsorization is a statistical data preprocessing technique utilized for handling outliers and heavy-tailed data.This method involves sorting the dataset in either ascending or descending order based on the analytical requirements.Next, the extreme values, namely outliers or values in the tails of the distribution, are substituted with the smallest or largest non-outlier value, correspondingly.Moreover, winsorization can be complemented with DAA techniques to enhance the precision and robustness of statistical analysis, particularly when confronted with datasets containing outliers or heavy-tailed distributions.
In microbiome data analysis, winsorization typically entails replacing the top 1 − τ largest values with the value at the τ quantile [20].Specifically, for a given taxon i, Y is was arranged in increasing order, and the τ quantile of {Y is } n s=1 was computed and denoted as q i (τ).We replace the observed count Y is by its winsorized value defined as While winsorization is known to reduce the influence of outliers, it remains unclear whether this method leads to the loss of essential data information.Furthermore, despite its widespread use, the impact of winsorization on DAA methods has not been thoroughly investigated.Consequently, this study seeks to comprehensively examine the impact of multiple levels of winsorization on DAA methods in various scenarios, both with and without outliers/heavy-tailedness, through several numerical examples.

Results
To comprehensively evaluate the performance of different M-estimation-based methods and winsorization at different levels, we conducted extensive simulations under various scenarios.Before describing the simulation settings, we provide a list of the methods we compared.
We evaluated six methods in our comparative analysis, namely LinDA [18] without winsorization (referred to as LinDA), LinDA with winsorization at the 97% quantile (referred to as LinDA97), LinDA with winsorization at the 90% quantile (referred to as LinDA90), M-estimation method with Huber's loss (referred to as Huber), M-estimation method with Tukey's bisquare loss (referred to as Bi_square), and quantile regression method (referred to as QR).To implement the Huber and bisquare methods, we used the rlm function in the MASS package (version: 7.3-60) in R to perform the regression estimation.For the selection of hyperparameters, we considered 10 values of c equally spaced within the interval [1.345, 5] on a log scale for the Huber method.For the Bi_square method, we took 10 values of c 0 equally spaced within the interval [4.685, 20] on a log scale.We used the rq function in the quantreg package (version: 5.95) in R to implement quantile regression.Similar to the Huber method, we took into account the quantile level τ that is evenly distributed across the interval [0.25, 0.75] with an adjacent difference of 0.05 (resulting in a total of 11 values).Remark 3. We also compared with other differential abundance analysis methods, including ALDEx2 [15], ANCOM-BC [16], and MaAsLin2 [17], and the results are deferred to Appendix C. Based on our simulation, we found that LinDA outperforms other methods that do not consider outliers and heavy-tailedness.Hence, we used LinDA as the benchmark method in the following and focus on comparing various log-linear model-based approaches used for addressing outliers.
To handle zero values, we employed the hybrid method proposed by Zhou et al. [18], which combines two different approaches.The first approach involves adding a pseudocount of 0.5 to all counts, which is a commonly used technique in microbiome data analysis on the log scale.The second approach is the imputation-based method, which involves replacing the zeros with fractions equal to N s / max{N k : Y ik = 0} for the i-th taxon in the s-th sample, where larger fractions are used for samples with larger library sizes.Zhou et al. [18] used a statistical test to determine which method to apply.Specifically, they tested the association between the covariate of interest and the library size based on the loglinear model.If the p-value was less than 0.1, they used the imputation approach; otherwise, they used the pseudo-count approach.More details can be found in Zhou et al. [18].

Simulations Based on Log-Linear Models
In this section, we simulated datasets from log-linear models.We adopted the datagenerating process proposed by Zhou et al. [18], while using different methods to generate error terms.We assumed that the baseline absolute abundance and the true proportion is obtained as We denote the signal strength as µ.In order to construct a power curve, we included six signal strengths in the figures, which are evenly spaced within the interval [1.05, 2].Given that low-abundance taxa exhibit lower statistical power, we assigned greater weight to their effects to prevent dominance by the abundant ones.Specifically, for the i-th taxon, we set for n = 50 and for n = 200.We randomly selected the differential taxon from the entire set and denoted γ i as an indicator of whether the taxon is differentially abundant (γ i = 1) or not (γ i = 0).The underlying truth of γ i was generated from a Bernoulli distribution with parameter p γ , where we set p γ = 0.05 or p γ = 0.2 to correspond to sparse and dense signal settings, respectively.We then denoted the true signal strength for the differentially abundant taxon by α i = µ i γ i .
To generate the absolute abundance X is , we considered two cases: with or without confounders, and three types of error terms.We defined is as the error term corresponding to the s-th sample of the i-th taxon, which we will define later.The absolute abundance X is was then obtained by log(X is ) = where β i0 is the intercept term and if there is no confounder and if there are confounders.Here, the confounders c s = (c s1 , c s2 ) ∈ R n×2 , with c s1 ∼ Bernoulli(0.5)and c s2 ∼ N (0, 1).The taxon-specific coefficients β i1 and β i2 are independently generated from the normal distributions with means of 1 and 2 and variances of 1, respectively.Then, the observed operational taxonomic unit (OTU) data were generated from where π is = X is / ∑ m j=1 X js .Moreover, the parameters β * i , σ 2 i * , and N s used in our study are identical to those used in Zhou et al. [18].Specifically, Zhou et al. [18] estimated the aforementioned parameters based on a real dataset (COMBO) that studied the gut microbiota in a general population.
We begin by examining the performance of the methods in the absence of heavytailedness and outliers.We aim to assess how much power is lost when using robust regression/winsorization.For this purpose, we sampled errors from a normal distribution with a mean of 0 and a variance of σ 2 i * .The results are presented in Appendix A.1.We consider sample sizes of n ∈ {50, 200}, referred to as the small sample case and the large sample case, respectively.We set the number of taxa as m = 500.In the small sample size scenario, LinDA97 shows the best performance in the sparse-signal setting, whereas LinDA outperforms other methods in the dense-signal setting.For larger sample sizes, LinDA, LinDA97, Huber, and Bi_square methods all perform similarly and outperform LinDA90.The QR method has the lowest power in all scenarios and this method has FDR inflation when n = 200 in the sparse-signal setting.

Heavy-Tailedness Setting
To demonstrate the advantages of our method in handling heavy-tailedness, instead of assuming a standard normal distribution for is , we generate error terms using three different ways: • Case 1: Student's t-distribution with degrees of freedom 3. • Case 2: Log-normal distribution with log mean parameter of 0 and log standard deviation parameter of 0.8.We recentered the samples so that it has a zero mean.• Case 3: Weibull distribution with shape parameter of 0.5 and scale parameter of 0.3.We recentered the samples so that it has a zero mean.
These three are all heavy-tailed distributions, also discussed in Fan et al. [29].It is important to underscore the significance of choosing appropriate parameters for generating error terms.For example, when the shape parameter of the Weibull distribution is small (e.g., 0.25), the resulting data can become excessively noisy, causing all methods to have no power to detect differential taxa.On the other hand, when the shape parameter of the Weibull distribution exceeds 1 (meaning the Weibull distribution is no longer heavy-tailed), the noise level diminishes significantly.As a result, all methods tend to achieve a power nearing 1.
We consider two sample size scenarios: n ∈ {50, 200}, denoted as the small and large sample cases, respectively.Additionally, we fix the number of taxa at m = 500.We conducted 100 simulation runs for each setting, calculated the mean power to detect differential taxa, and computed the empirical FDR.The results are presented in plots.This section showcases results for settings without confounders in the sparse scenario (p γ = 0.05).Similar phenomena were observed in both the dense scenario and the case with confounders.These are detailed in Appendices A.2 and A.3, respectively.
Figure 1 presents the results obtained when error terms are generated from a tdistribution (Case 1).All methods show an increased power to detect differential taxa as the sample size grows.The Huber, Bi_square, and LinDA90 methods perform comparably for small sample sizes, each showing higher power than the LinDA97 and LinDA methods.
However, as the sample size grows, the performances of the LinDA90, LinDA97, Huber, and Bi_square methods converge, all surpassing the power of the LinDA method.Notably, the QR method consistently exhibits the lowest power and suffers from significant FDR inflation.The results for error terms generated from a Log-normal distribution (Case 2) are presented in Figure 2. Like the previous case, all methods exhibit an increased power to detect differential taxa with a larger sample size.In both small and large sample size scenarios, the Huber and Bi_square methods lead in power.LinDA90 follows, outperforming LinDA97.Notably, when the sample size is small, LinDA90 has an FDR inflation.Upon closer inspection, the Huber method outperforms the Bi_square method regarding power and FDR control.Although the QR method has higher power than LinDA when n = 200, it comes at the cost of considerable FDR inflation.The outcomes for the setting in which error terms are sampled from a Weibull distribution (Case 3) are shown in Figure 3.As the sample size increases, all methods exhibit enhanced power in detecting differential taxa.When the sample size is small, all methods display limited power for detecting differential taxa, with a power of approximately 0.1.In contrast, with larger sample sizes, the Huber method outperforms all other methods, followed by the Bi_square and LinDA97 methods.Notably, Bi_square shows FDR inflation when the sample size is small, whereas the LinDA90 method exhibits significant FDR inflation with large sample sizes.Since the FDR of the QR method exceeds 0.3 when n = 200, it was excluded from the plot.

With Outliers Setting
We employ the following procedure to generate the data to demonstrate the impact of the number of outliers on the power of the DAA method.The number of taxa is fixed at m = 500, and the number of samples is set to n = 100.Following the datageneration process discussed in Section 3.1, we begin by sampling error terms from a normal distribution with a mean of 0 and a variance of σ 2 i * for the i-th taxon.The outliers are generated by randomly selecting a subset of nonzero counts and multiplying them by a fold change of 20.More specifically, we randomly choose 250, 500, 1000, and 2000 nonzero counts, corresponding to an average of 0.5, 1, 2, and 4 outliers per taxon, respectively.
Let ρ denote the average number of outliers per taxon.The results for the case when outliers exist without confounders are presented in Figure 4.As the number of outliers increases, the power of all methods will decrease.When the number of outliers is small (ρ = 0.05), LinDA97 exhibits slightly better performance across all methods.However, as the number of outliers increases, the performance of Huber and LinDA90 improves.Specifically, when ρ = 4, LinDA90 demonstrates the highest power for small signal strengths, while Huber exhibits the highest power for large signal strengths.

Real Data Analysis
We utilized three real datasets, including independent samples from studies on C. difficile infection (CDI) [30], inflammatory bowel disease (IBD) [31], and rheumatoid arthritis (RA) [12].The CDI and RA datasets were downloaded from the links provided in the original paper, while the IBD dataset was obtained from the Qiita database [32] using study IDs 1460 and 524.To ensure data quality, we excluded samples with less than 1000 read counts and taxa that appeared in less than 10% of the samples for each dataset.The variable that we are interested in is binary phenotypes across all datasets.In the case of the IBD dataset, the confounding factor is the usage of antibiotics.
First, we compared the detection power of all methods discussed in Section 3 across the three datasets.Figure 5 presents the number of discoveries at various FDR levels (0.01-0.25).For the CDI dataset, the LinDA97, LinDA, and Huber methods exhibited the highest number of discoveries at an FDR level of 0.1.As the FDR level increased, the Bi_square method identified the most discoveries.For the IBD dataset, all methods except QR had a similar number of discoveries at an FDR level of 0.1.However, the Bi_square and Huber methods outperformed others regarding discoveries at different FDR levels in the overall analysis.Conversely, the LinDA and Huber methods consistently identified the most discoveries across all FDR levels in the RA dataset.Therefore, the Huber method displayed superior discovery capability overall.It is worth mentioning that the QR method consistently yielded the fewest discoveries in all three datasets, which is consistent with our simulation results.Subsequently, to illustrate the overlapping discoveries among different methods, we employed the UpSet plot [33] to depict the overlap at the target FDR level of 0.1.Figure 6 presents the overlap of differentially abundant taxa across the three real datasets at this FDR level.It is evident that, in most cases, the taxa identified by the Huber method are consistently identified by other methods as well.This consistency implies that the taxa identified by the Huber method are more likely to be "truly" differentially abundant.Conversely, the remaining methods exhibit independent findings, lacking support from other methods, suggesting a higher likelihood of false discoveries.Consequently, the Huber method demonstrates greater robustness in practical applications.As an illustration, we present a boxplot of the taxa expression after applying the CLR transformation, focusing solely on taxa identified only by the LinDA method.Figure 7 displays an example of a taxa expression after CLR, identified only by the LinDA method in the RA dataset.The left panel shows the original data, while the right panel represents the data with potential outliers removed.In this case, we remove samples whose taxa expression after CLR exceeds 1.5.After removing these samples that significantly deviate from the rest, we observe similar means between the two groups.However, LinDA identifies this taxon as differentially abundant due to outliers, which may be a false discovery.

Conclusions and Discussion
This study investigates the influence of heavy-tailedness and outliers on differential abundance analysis using different methods and proposes effective strategies to address them.The presence of heavy-tailedness and outliers can substantially reduce the power of existing DAA methods.To resolve this issue, we propose two techniques to enhance the robustness of DAA methods.First, we introduce a general regression framework for DAA by extending the LinDA method.This framework includes differential analysis based on Huber regression as a special case, which is more stable in handling outliers.Second, we propose the winsorization method, which involves winsorizing the data before applying DAA methods.
Our findings from both simulation and real data indicate that, among the M-estimationbased approaches, the Huber method, in particular, demonstrates more robust performance in handling heavy-tailedness and the presence of outliers in the dataset.Specifically, compared to LinDA with winsorization, the Huber method exhibits higher power while maintaining FDR control.Furthermore, the Huber method demonstrates better FDR control than the Bi_square method.Consequently, we recommend utilizing the Huber method as an effective approach to address outliers and heavy-tailedness.Additionally, the optimal quantile level for winsorization remains unclear.Therefore, another advantage of the Huber method is that it does not require the selection of such hyperparameters.

Appendix B. Mixed-Effects Models
Based on the regression-based framework, our method can be readily extended to apply to the mixed-effect model.Consider the linear mixed-effect model: where γ i represents the random effect and v s represents the corresponding design matrix.Mixed-effects analysis can be used to analyze correlated microbiome data from studies in-volving replicates or spatial sampling, as well as family-based and longitudinal microbiome studies.Similarly, we need to address the regression problem: where any method for dealing with linear mixed effects can be applied to obtain the estimate here.
For robustly fitting the linear mixed-effects models, we utilize the rlmer function provided by the R package robustlmm (version: 3.2-0) [34], which is a Huber loss-based method.Furthermore, we use the get_Lb_ddf function available in the R package pbkrtest (version: 0.5.2) to compute the degrees of freedom using the Kenward-Roger approximation [35].Due to the considerable computational time required for fitting a robust linear mixed-effects model, we opted to select specific hyperparameters.Following the recommendation by Koller [34], we chose c = 1.345 and c = 2.28.Subsequently, we employed the Cauchy method to combine the p-values.
To evaluate the performance of our proposed method, we employed the data generation process proposed by Zhou et al. [18] but used different methods to generate the random effect terms and the error terms.Consider the scenario involved in replicate sampling, where we generated 50 subjects with 4 replicates, for n = 200.The absolute abundance X is was generated by log(X is ) = where ω is is the random effect and is is the error term.

Appendix B.1. Heavy-Tailedness Setting
In the heavy-tailedness setting, the random effect terms and error terms of the mixedeffects model were evaluated in two different settings, corresponding to Case 2 and 3, as described in Section 3.1.

1.
Case1: We sampled ω is and is from Log-normal distribution with log mean parameter of 0 for both terms and log standard deviation parameter of 0.5 and 0.8, respectively.We recentered the samples so that it has a zero mean.

2.
Case2: We sampled ω is and is from a Weibull distribution with scale parameter of 0.3 for both terms and shape parameter of 0.8 and 0.5, respectively.We recentered the samples so that it has a zero mean.
For all three cases, the observed OTUs were generated following the same method as the log-linear model described in Section 3.1.Both sparse-signal (p γ = 0.05) and dense-signal (p γ = 0.2) settings were considered.
Figure A8 illustrates the outcomes of the mixed-effects model, with random effects and errors generated from a Log-normal distribution.In both the sparse and dense-signal settings, the Huber method displays the highest power, while maintaining absence of FDR inflation.In the sparse-signal setting, the power of LinDA90 outperforms that of LinDA97, while both methods exhibit comparable performance in the dense-signal setting.Figure A9 gives the results of the mixed-effects model, where the random effects and errors were generated from a Weibull distribution.In scenarios with low signal strength, LinDA97 demonstrates the best performance, while Huber surpasses all methods as the signal strength increases.Notably, LinDA90 is the only method exhibiting FDR inflation in this particular setting.

Appendix C. Additional Numerical Results for Comparison with Other Methods
In this section, we compared our method with other differential abundance analysis methods, including ALDEx2 [15], ANCOM-BC [16], and MaAsLin2 [17].We focus on the simulation settings based on log-linear models we present in Section 3.1.

Figure 1 .
Figure 1.The log-linear model yielded results in the absence of confounding variables in the sparsesignal setting (p γ = 0.05), where errors were generated from a t-distribution.The left panel represents a sample size of n = 50 and the right panel corresponds to a sample size of n = 200.

Figure 2 .
Figure 2. The log-linear model yielded results in the absence of confounding variables in the sparsesignal setting (p γ = 0.05), where errors were generated from a Log-normal distribution.The left panel represents a sample size of n = 50 and the right panel corresponds to a sample size of n = 200.

Figure 3 .
Figure 3.The log-linear model yielded results in the absence of confounding variables in the sparsesignal setting (p γ = 0.05), where errors were generated from a Weibull distribution.The left panel represents a sample size of n = 50 and the right panel corresponds to a sample size of n = 200.

Figure 4 .
Figure 4.The log-linear model yielded results in the absence of confounding variables in the sparsesignal setting (p γ = 0.05) with outliers.ρ is the average number of outliers per taxon.

Figure 5 .
Figure 5. Number of discoveries with respect to different FDR levels (0.01-0.25) for three real datasets.

Figure 6 .
Figure 6.The overlap of differentially abundant taxa detected by various DAA methods across three real datasets at an FDR level of 0.1.

Figure 7 .
Figure 7.An example of a taxa expression after CLR.The taxa are only identified by the LinDA method in the RA dataset.The left panel shows the original data.The right panel represents the data with potential outliers removed.

Figure A1 .Figure A2 .Figure A3 .Figure A4 .Figure A5 .Figure A6 .Figure A7 .
Figure A1.Results for the log-linear model with errors generated from a normal distribution without confounders.The left column corresponds to a sample size of n = 50 and the right column corresponds to a sample size of n = 200.The first group of figures shows results in the sparse-signal setting and the second group of figures shows results in the dense-signal setting.Appendix A.2. Numerical Results for Dense-Signal SettingFigureA2shows the results for error terms generated from t-distribution; FigureA3shows the results for error terms generated from Weibull distribution; FigureA4shows the results for the case with outliers.

Figure A8 .
Figure A8.Results for the mixed-effects model where random effects and errors were generated from a Log-normal distribution.The left panel represents the sparse-signal setting and the right panel corresponds to dense-signal setting.

Figure A9 .
Figure A9.Results for the mixed-effects model where random effects and errors were generated from a Weibull distribution.The left panel represents the sparse-signal setting and the right panel corresponds to dense-signal setting.

Figure A12 .Figure A13 .Figure A14 .Figure A15 .
Figure A12.The log-linear model yielded results in the absence of confounding variables in the sparse-signal setting (p γ = 0.05), where errors were generated from a t-distribution.The left panel represents a sample size of n = 50 and the right panel corresponds to a sample size of n = 200.